This week, you're going to expand on what you've learned last week. First, you'll re-view how to do linear regression with statsmodels, and then learn how to identify outliers, leverage points and influential points. Then, we're going to use the bird data set to do linear regression in multiple variables (multilinear regression). We will make use of the machine-learning library scikit-learn (https://scikit-learn.org) and the statistical visualization library seaborn (https://seaborn.pydata.org). Finally, you'll learn how to use LDA for more complication classification problems.
keywords: regression: outliers, leverage points, multilinear regression, logistic regression — classifiers: Linear Discriminant Analysis (LDA)
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.formula.api as sm
from sklearn import (linear_model, datasets, metrics,
discriminant_analysis)
# conda/pip install scikit-learn if you get import errors
In this first exercise, you're going to practice how to identify outliers, leverage points and influential points. We'll use a simple linear regression with one predictor.
# to-do: create a dataframe by loading the 'numbers.csv' file
numbers = pd.read_csv("../02/data/numbers.csv")
numbers.head()
# create a random sample
np.random.seed(123)
rd_sample = np.random.randint(0, len(numbers), size = 100)
# to-do: perform a linear regression with sm.ols()
x = numbers.iloc[:, rd_sample[0:50]].values.flatten()
y = numbers.iloc[:, rd_sample[50:]].values.flatten()
df = pd.DataFrame({
"x" : x,
"y" : y
})
m1 = sm.ols(formula = "y ~ x", data = df).fit()
m1.summary()
| Dep. Variable: | y | R-squared: | 0.000 |
|---|---|---|---|
| Model: | OLS | Adj. R-squared: | -0.000 |
| Method: | Least Squares | F-statistic: | 0.3446 |
| Date: | Sun, 21 Mar 2021 | Prob (F-statistic): | 0.557 |
| Time: | 14:42:11 | Log-Likelihood: | -70814. |
| No. Observations: | 50000 | AIC: | 1.416e+05 |
| Df Residuals: | 49998 | BIC: | 1.416e+05 |
| Df Model: | 1 | ||
| Covariance Type: | nonrobust |
| coef | std err | t | P>|t| | [0.025 | 0.975] | |
|---|---|---|---|---|---|---|
| Intercept | 0.0018 | 0.004 | 0.401 | 0.689 | -0.007 | 0.011 |
| x | 0.0026 | 0.004 | 0.587 | 0.557 | -0.006 | 0.011 |
| Omnibus: | 2.458 | Durbin-Watson: | 2.014 |
|---|---|---|---|
| Prob(Omnibus): | 0.293 | Jarque-Bera (JB): | 2.438 |
| Skew: | 0.007 | Prob(JB): | 0.296 |
| Kurtosis: | 2.968 | Cond. No. | 1.00 |
# to-do: calculate the leverage and plot the fit against the residuals
line = m1.params[0] + x * m1.params[1]
n = len(x)
h = 1/n + (line[:] - np.mean(y))/np.std(y)
plt.figure()
plt.scatter(h, (line-y)/np.std(x))
plt.plot(h, np.zeros_like(h), "--")
plt.xlabel("Leverage")
plt.ylabel("Std. Residuals") # check if correct
Text(0, 0.5, 'Std. Residuals')
In this second exercise, you'll be working with the bird dataset from last week (bird_data_vincze_etal_2015.csv) to try to predict migration distances.
The usage of linear_model.LinearRegression is slightly different than what you've used before. Here's how it works:
# to-do: load the data into a pandas dataframe and see if it requires any cleaning
bird = pd.read_csv("data/bird_data_vincze_etal_2015.csv")
print(bird.head)
bird = bird.dropna() # removing NAs
sns.set(rc={"figure.figsize" : (10,10)})
sns.pairplot(bird) # most variables skewed
<bound method NDFrame.head of Species Migration distance Body mass Brain mass \
0 Accipiter gentilis 542.0 1100.0 7.674
1 Accipiter nisus 2938.1 260.0 3.081
2 Acrocephalus scirpaceus 3577.0 14.0 0.483
3 Aegithalos caudatus 26.1 7.5 0.456
4 Aegypius monachus 972.2 9000.0 24.808
.. ... ... ... ...
147 Turdus merula 730.6 95.0 1.808
148 Turdus philomelos 2921.1 67.0 1.459
149 Tyto alba 0.0 290.0 6.068
150 Upupa epops 3281.0 55.0 1.233
151 Vanellus vanellus 1976.7 200.0 2.208
Size of cerebellum Size of telencephalon Size of optic lobe \
0 1.088 4.617 0.837
1 0.499 1.593 0.440
2 0.066 0.282 0.070
3 0.035 0.283 0.059
4 2.812 17.936 1.020
.. ... ... ...
147 0.194 1.161 0.205
148 0.169 0.895 0.190
149 0.460 4.548 0.215
150 0.168 0.821 0.094
151 0.324 1.216 0.308
Wing aspect ratio Wing area
0 5.841 0.193
1 5.891 0.093
2 4.835 0.008
3 4.463 0.007
4 NaN NaN
.. ... ...
147 4.833 0.030
148 5.460 0.025
149 6.510 0.145
150 4.870 0.043
151 6.129 0.084
[152 rows x 9 columns]>
<seaborn.axisgrid.PairGrid at 0x17103d18220>
# to-do: use help() or look up the function online
help(linear_model.LinearRegression)
# ask for help if you don't understand how to use this function
Help on class LinearRegression in module sklearn.linear_model._base:
class LinearRegression(sklearn.base.MultiOutputMixin, sklearn.base.RegressorMixin, LinearModel)
| LinearRegression(*, fit_intercept=True, normalize=False, copy_X=True, n_jobs=None, positive=False)
|
| Ordinary least squares Linear Regression.
|
| LinearRegression fits a linear model with coefficients w = (w1, ..., wp)
| to minimize the residual sum of squares between the observed targets in
| the dataset, and the targets predicted by the linear approximation.
|
| Parameters
| ----------
| fit_intercept : bool, default=True
| Whether to calculate the intercept for this model. If set
| to False, no intercept will be used in calculations
| (i.e. data is expected to be centered).
|
| normalize : bool, default=False
| This parameter is ignored when ``fit_intercept`` is set to False.
| If True, the regressors X will be normalized before regression by
| subtracting the mean and dividing by the l2-norm.
| If you wish to standardize, please use
| :class:`~sklearn.preprocessing.StandardScaler` before calling ``fit``
| on an estimator with ``normalize=False``.
|
| copy_X : bool, default=True
| If True, X will be copied; else, it may be overwritten.
|
| n_jobs : int, default=None
| The number of jobs to use for the computation. This will only provide
| speedup for n_targets > 1 and sufficient large problems.
| ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
| ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
| for more details.
|
| positive : bool, default=False
| When set to ``True``, forces the coefficients to be positive. This
| option is only supported for dense arrays.
|
| .. versionadded:: 0.24
|
| Attributes
| ----------
| coef_ : array of shape (n_features, ) or (n_targets, n_features)
| Estimated coefficients for the linear regression problem.
| If multiple targets are passed during the fit (y 2D), this
| is a 2D array of shape (n_targets, n_features), while if only
| one target is passed, this is a 1D array of length n_features.
|
| rank_ : int
| Rank of matrix `X`. Only available when `X` is dense.
|
| singular_ : array of shape (min(X, y),)
| Singular values of `X`. Only available when `X` is dense.
|
| intercept_ : float or array of shape (n_targets,)
| Independent term in the linear model. Set to 0.0 if
| `fit_intercept = False`.
|
| See Also
| --------
| Ridge : Ridge regression addresses some of the
| problems of Ordinary Least Squares by imposing a penalty on the
| size of the coefficients with l2 regularization.
| Lasso : The Lasso is a linear model that estimates
| sparse coefficients with l1 regularization.
| ElasticNet : Elastic-Net is a linear regression
| model trained with both l1 and l2 -norm regularization of the
| coefficients.
|
| Notes
| -----
| From the implementation point of view, this is just plain Ordinary
| Least Squares (scipy.linalg.lstsq) or Non Negative Least Squares
| (scipy.optimize.nnls) wrapped as a predictor object.
|
| Examples
| --------
| >>> import numpy as np
| >>> from sklearn.linear_model import LinearRegression
| >>> X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
| >>> # y = 1 * x_0 + 2 * x_1 + 3
| >>> y = np.dot(X, np.array([1, 2])) + 3
| >>> reg = LinearRegression().fit(X, y)
| >>> reg.score(X, y)
| 1.0
| >>> reg.coef_
| array([1., 2.])
| >>> reg.intercept_
| 3.0000...
| >>> reg.predict(np.array([[3, 5]]))
| array([16.])
|
| Method resolution order:
| LinearRegression
| sklearn.base.MultiOutputMixin
| sklearn.base.RegressorMixin
| LinearModel
| sklearn.base.BaseEstimator
| builtins.object
|
| Methods defined here:
|
| __init__(self, *, fit_intercept=True, normalize=False, copy_X=True, n_jobs=None, positive=False)
| Initialize self. See help(type(self)) for accurate signature.
|
| fit(self, X, y, sample_weight=None)
| Fit linear model.
|
| Parameters
| ----------
| X : {array-like, sparse matrix} of shape (n_samples, n_features)
| Training data
|
| y : array-like of shape (n_samples,) or (n_samples, n_targets)
| Target values. Will be cast to X's dtype if necessary
|
| sample_weight : array-like of shape (n_samples,), default=None
| Individual weights for each sample
|
| .. versionadded:: 0.17
| parameter *sample_weight* support to LinearRegression.
|
| Returns
| -------
| self : returns an instance of self.
|
| ----------------------------------------------------------------------
| Data and other attributes defined here:
|
| __abstractmethods__ = frozenset()
|
| ----------------------------------------------------------------------
| Data descriptors inherited from sklearn.base.MultiOutputMixin:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
|
| ----------------------------------------------------------------------
| Methods inherited from sklearn.base.RegressorMixin:
|
| score(self, X, y, sample_weight=None)
| Return the coefficient of determination :math:`R^2` of the
| prediction.
|
| The coefficient :math:`R^2` is defined as :math:`(1 - \frac{u}{v})`,
| where :math:`u` is the residual sum of squares ``((y_true - y_pred)
| ** 2).sum()`` and :math:`v` is the total sum of squares ``((y_true -
| y_true.mean()) ** 2).sum()``. The best possible score is 1.0 and it
| can be negative (because the model can be arbitrarily worse). A
| constant model that always predicts the expected value of `y`,
| disregarding the input features, would get a :math:`R^2` score of
| 0.0.
|
| Parameters
| ----------
| X : array-like of shape (n_samples, n_features)
| Test samples. For some estimators this may be a precomputed
| kernel matrix or a list of generic objects instead with shape
| ``(n_samples, n_samples_fitted)``, where ``n_samples_fitted``
| is the number of samples used in the fitting for the estimator.
|
| y : array-like of shape (n_samples,) or (n_samples, n_outputs)
| True values for `X`.
|
| sample_weight : array-like of shape (n_samples,), default=None
| Sample weights.
|
| Returns
| -------
| score : float
| :math:`R^2` of ``self.predict(X)`` wrt. `y`.
|
| Notes
| -----
| The :math:`R^2` score used when calling ``score`` on a regressor uses
| ``multioutput='uniform_average'`` from version 0.23 to keep consistent
| with default value of :func:`~sklearn.metrics.r2_score`.
| This influences the ``score`` method of all the multioutput
| regressors (except for
| :class:`~sklearn.multioutput.MultiOutputRegressor`).
|
| ----------------------------------------------------------------------
| Methods inherited from LinearModel:
|
| predict(self, X)
| Predict using the linear model.
|
| Parameters
| ----------
| X : array-like or sparse matrix, shape (n_samples, n_features)
| Samples.
|
| Returns
| -------
| C : array, shape (n_samples,)
| Returns predicted values.
|
| ----------------------------------------------------------------------
| Methods inherited from sklearn.base.BaseEstimator:
|
| __getstate__(self)
|
| __repr__(self, N_CHAR_MAX=700)
| Return repr(self).
|
| __setstate__(self, state)
|
| get_params(self, deep=True)
| Get parameters for this estimator.
|
| Parameters
| ----------
| deep : bool, default=True
| If True, will return the parameters for this estimator and
| contained subobjects that are estimators.
|
| Returns
| -------
| params : dict
| Parameter names mapped to their values.
|
| set_params(self, **params)
| Set the parameters of this estimator.
|
| The method works on simple estimators as well as on nested objects
| (such as :class:`~sklearn.pipeline.Pipeline`). The latter have
| parameters of the form ``<component>__<parameter>`` so that it's
| possible to update each component of a nested object.
|
| Parameters
| ----------
| **params : dict
| Estimator parameters.
|
| Returns
| -------
| self : estimator instance
| Estimator instance.
# to-do: use multilinear regression to predict the migration distance and
# discuss whether your prediction is any good
X = bird.drop(["Migration distance", "Species"], axis = 1).values
Y = bird.loc[:, bird.columns == "Migration distance"].values
reg = linear_model.LinearRegression().fit(X, y = Y)
print("parameters: ", reg.coef_)
print("R-squared: ", reg.score(X, y = Y))
pred = reg.predict(X)
mse = np.mean((pred - Y)**2)
print("MSE: ", mse) ## pretty bad...
parameters: [[ 6.32176066e-01 -7.84027065e+02 -3.99147493e+02 5.76379428e+02 9.32376011e+02 1.04270649e+03 1.49547595e+03]] R-squared: 0.30482040064955884 MSE: 4137159.031044556
# optional to-do: look for collinearities in the predictors
corrMat = bird.drop(["Migration distance", "Species"], axis = 1).corr()
sns.heatmap(corrMat, annot=True) ## a lot of variables are correlated with eachother
<AxesSubplot:>
In the lecture you've learned about logistic regression, which is the appropriate tool when the dependent variable is dichotomous (binary). We're going to apply this tool the iris flower dataset (https://en.wikipedia.org/wiki/Iris_flower_data_set) and see if we can classify flowers correctly.
The iris dataset is included in scikit-learn.
iris = datasets.load_iris()
dir(iris)
['DESCR', 'data', 'feature_names', 'filename', 'frame', 'target', 'target_names']
This function returns a dictionary that contains a bunch of different fields. All the data points are contained in 'data'. The four columns correspond to the following four features.
iris.data.shape
(150, 4)
iris.feature_names
['sepal length (cm)', 'sepal width (cm)', 'petal length (cm)', 'petal width (cm)']
For every data point, the respective class label is stored in target. There is a total of three classes.
iris.target.shape
(150,)
iris.target_names
array(['setosa', 'versicolor', 'virginica'], dtype='<U10')
As usual, it's a good idea to have a look at the data before starting with your analysis. Let's do some pairplots with seaborn.
df = pd.DataFrame(iris.data) # create a dataframe from the iris data
cols = [name for name in iris.feature_names] # define the column names
df.columns = cols # set the column names
# alternative way:
# df = pd.DataFrame(dict(zip(iris.feature_names, np.transpose(iris.data))))
df['species'] = iris.target # add the species column
g = sns.pairplot(df, vars=cols, hue='species', palette='colorblind')
We want to focus on binary classification, and hence we discard all data points belonging to a certain class. This can be done by, for example, selecting all the rows that do not belong to that class. Here, we drop the class 'virginica'.
idcs = iris.target != 2 # get all the indices where the class is not 2
data = iris.data[idcs].astype(np.float32)
target = iris.target[idcs].astype(np.int)
<ipython-input-9-bf3d2c1d546f>:3: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations target = iris.target[idcs].astype(np.int)
# to-do: use one of the flower features to build a logistic regression
X = data[:, 2] # petal length as predictor
X = np.reshape(X, (-1, 1)) # log_regression needs a 2D array
log_reg = linear_model.LogisticRegression()
log_reg.fit(X = X, y = target)
prob = log_reg.predict_proba(X)
print(prob)
# model that predicts to which of the two classes the flower
pred = log_reg.predict(X)
# belongs to (setosa or versicolor)
conf_matrix = metrics.confusion_matrix(y_true=target, y_pred=pred)
vis = metrics.ConfusionMatrixDisplay(confusion_matrix=conf_matrix, display_labels=iris.target_names[0:2])
vis.plot()
print("Accuracy: ", metrics.accuracy_score(y_true=target, y_pred=pred))
[[0.97866264 0.02133736] [0.97866264 0.02133736] [0.98394781 0.01605219] [0.97168741 0.02831259] [0.97866264 0.02133736] [0.95053351 0.04946649] [0.97866264 0.02133736] [0.97168741 0.02831259] [0.97866264 0.02133736] [0.97168741 0.02831259] [0.97168741 0.02831259] [0.96251929 0.03748071] [0.97866264 0.02133736] [0.99094845 0.00905155] [0.98793999 0.01206001] [0.97168741 0.02831259] [0.98394781 0.01605219] [0.97866264 0.02133736] [0.95053351 0.04946649] [0.97168741 0.02831259] [0.95053351 0.04946649] [0.97168741 0.02831259] [0.99321159 0.00678841] [0.95053351 0.04946649] [0.91495767 0.08504233] [0.96251929 0.03748071] [0.96251929 0.03748071] [0.97168741 0.02831259] [0.97866264 0.02133736] [0.96251929 0.03748071] [0.96251929 0.03748071] [0.97168741 0.02831259] [0.97168741 0.02831259] [0.97866264 0.02133736] [0.97168741 0.02831259] [0.98793999 0.01206001] [0.98394781 0.01605219] [0.97866264 0.02133736] [0.98394781 0.01605219] [0.97168741 0.02831259] [0.98394781 0.01605219] [0.98394781 0.01605219] [0.98394781 0.01605219] [0.96251929 0.03748071] [0.91495767 0.08504233] [0.97866264 0.02133736] [0.96251929 0.03748071] [0.97866264 0.02133736] [0.97168741 0.02831259] [0.97866264 0.02133736] [0.00319087 0.99680913] [0.00568476 0.99431524] [0.00178908 0.99821092] [0.02379338 0.97620662] [0.00425979 0.99574021] [0.00568476 0.99431524] [0.00319087 0.99680913] [0.1565311 0.8434689 ] [0.00425979 0.99574021] [0.0315456 0.9684544 ] [0.09412582 0.90587418] [0.01346286 0.98653714] [0.02379338 0.97620662] [0.00319087 0.99680913] [0.07214031 0.92785969] [0.00758276 0.99241724] [0.00568476 0.99431524] [0.01791101 0.98208899] [0.00568476 0.99431524] [0.0315456 0.9684544 ] [0.00238953 0.99761047] [0.02379338 0.97620662] [0.00178908 0.99821092] [0.00319087 0.99680913] [0.01010802 0.98989198] [0.00758276 0.99241724] [0.00238953 0.99761047] [0.0013393 0.9986607 ] [0.00568476 0.99431524] [0.09412582 0.90587418] [0.04171569 0.95828431] [0.05497838 0.94502162] [0.0315456 0.9684544 ] [0.00100249 0.99899751] [0.00568476 0.99431524] [0.00568476 0.99431524] [0.00319087 0.99680913] [0.00758276 0.99241724] [0.01791101 0.98208899] [0.02379338 0.97620662] [0.00758276 0.99241724] [0.00425979 0.99574021] [0.02379338 0.97620662] [0.1565311 0.8434689 ] [0.01346286 0.98653714] [0.01346286 0.98653714] [0.01346286 0.98653714] [0.01010802 0.98989198] [0.30698153 0.69301847] [0.01791101 0.98208899]] Accuracy: 1.0
# to-do: visualize your result by using the predict_proba method of the logistic regression classifier
# constructing the probability regression array
prob_plt = log_reg.predict_proba(np.linspace(min(X), max(X), num=100))[:,1]
plt.figure(figsize=(10,10))
plt.plot(np.linspace(min(X), max(X), num=100), prob_plt, color="red")
plt.scatter(X, target)
plt.show()
# to-do: have a look at the metrics.accuracy_score method to quantify how good your prediction is
print("Accuracy: ", metrics.accuracy_score(y_true=target, y_pred=pred))
Accuracy: 1.0
We're going to keep working with the iris dataset from the last exercise. We will use the built-in classifier in scikit-learn.
Start off with a scatter plot where the colors indicate the species.
fig, ax = plt.subplots(1, 1, figsize=(8, 6))
sns.set(rc={"figure.figsize" : (10,10)})
sns.scatterplot('sepal length (cm)', 'petal length (cm)',
hue='species', data=df, palette='Set1')
C:\Users\adelu\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn(
<AxesSubplot:xlabel='sepal length (cm)', ylabel='petal length (cm)'>
# to-do: use the discriminant_analysis.LinearDiscriminantAnalysis classifier on
# the (full) iris dataset to predict the flower species as a function of sepal/petal length
X = df.loc[:, df.columns != "species"].values
y = df.loc[:, "species"]
lda = discriminant_analysis.LinearDiscriminantAnalysis()
lda.fit(X, y)
print(lda.predict(X))
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] (150, 2)
# challengeing to-do: visualize and quantify how well the classifier is working
x = df.loc[:, ["sepal length (cm)", "petal length (cm)"]].values
lda_plt = discriminant_analysis.LinearDiscriminantAnalysis()
lda_plt.fit(X=x, y=y)
colpal = sns.color_palette("Set1", as_cmap=True)
sns.scatterplot('sepal length (cm)', 'petal length (cm)',
hue='species', data=df, palette="Set1")
nx , ny = 100, 100
xmin, xmax = plt.xlim()
ymin, ymax = plt.ylim()
xx, yy = np.meshgrid(
np.linspace(xmin, xmax, nx),
np.linspace(ymin, ymax, ny)
)
z = lda_plt.predict_proba(np.c_[xx.ravel(), yy.ravel()])
z = z[:, 1].reshape(xx.shape)
plt.pcolormesh(xx, yy, z, cmap=colpal, alpha=.5)
plt.contour(xx, yy, z, colors="white")
# confusion matrix
conf_matrix = metrics.confusion_matrix(y_true=y, y_pred=lda.predict(X))
vis = metrics.ConfusionMatrixDisplay(confusion_matrix=conf_matrix, display_labels=iris.target_names)
vis.plot()
C:\Users\adelu\AppData\Local\Programs\Python\Python39\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variables as keyword args: x, y. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( <ipython-input-38-b8792d66197d>:17: MatplotlibDeprecationWarning: shading='flat' when X and Y have the same dimensions as C is deprecated since 3.3. Either specify the corners of the quadrilaterals with X and Y, or pass shading='auto', 'nearest' or 'gouraud', or set rcParams['pcolor.shading']. This will become an error two minor releases later. plt.pcolormesh(xx, yy, z, cmap=colpal, alpha=.5)
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x1f0987b1a30>